HW 01

Homework 1
Author

Brooke Pacheco

Published

May 28, 2025

0 - Setup

# load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glue)
library(here)
here() starts at C:/Users/bpach/Documents/repos/INFO_526/hw1/info-526-su25-classroom-hw-01-hw-01-1
library(countdown)
library(ggthemes)
library(gt)
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Attaching package: 'openintro'

The following object is masked from 'package:gt':

    sp500
library(ggrepel)
library(patchwork)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 11))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

if (!require("pacman")) 
  install.packages("pacman")
Loading required package: pacman
# use this line for installing/loading
pacman::p_load(tidyverse,
               glue,
               scales,
               ggthemes) 

devtools::install_github("tidyverse/dsbox")
Using github PAT from envvar GITHUB_PAT. Use `gitcreds::gitcreds_set()` and unset GITHUB_PAT in .Renviron (or elsewhere) if you want to use the more secure git credential store instead.
Skipping install of 'dsbox' from a github remote, the SHA1 (244ecdfe) has not changed since last install.
  Use `force = TRUE` to force installation

1 - Road traffic accidents in Edinburgh

# Read in data from accidents file
accidents <- read_csv(here(
  "data" ,"accidents.csv"), show_col_types = FALSE)

# Create a new column - wrangle the data
accidents_wrangle <- accidents |>
  mutate(
    week = case_when(
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend",
      TRUE ~ "Weekday"
    ),
    week = fct_relevel(week, "Weekday", "Weekend")
  ) 

# Create the plot
ggplot(accidents_wrangle, aes(x = time, fill = severity, group = severity)) +
  geom_density(color = "black", alpha = 0.5) +
  scale_fill_manual(values = c("#AA93B0", "#9ECAC8", "#FEF39F")) +
  labs(
    x = "Time of day",
    y = "Density",
    title = "Number of accidents throughout the day",
    subtitle = "By day of the week and severity",
    fill = "Severity"
  ) + 
  facet_wrap(~ week, nrow = 2) +
  theme_minimal(base_size = 11)

This plot shows the number of accidents throughout the day, categorized by severity: fatal, serious, and slight. It appears that on weekdays, there are more fatal accidents and a higher overall number of accidents, especially in the afternoon and evening. On weekends, there tend to be more slight and serious accidents during the day, with a noticeable increase in the afternoon and evening hours.

2 - NYC marathon winners

2a

# Read in data from nyc marathon file
marathon <- read_csv(here(
  "data" ,"nyc_marathon.csv"), show_col_types = FALSE)

# Remove NA in time hrs columns
marathon <- marathon %>% filter(!is.na(time_hrs))

# Create the histogram plot
ggplot(marathon, aes(x = time_hrs)) +
  geom_histogram(binwidth = 0.1) +
  labs (
    title = "Histogram of all runners in data set",
    x = "Time in hours",
    y = "Count"
  ) +
  theme_minimal(base_size = 11)

# Create the box plot
ggplot(marathon, aes(x = time_hrs)) +
  geom_boxplot(outlier.size = 2) +
  labs (
    title = "Box plot of all runners in data set",
    x = "Time in hours",
  ) +
  theme_minimal(base_size = 11)

The histogram makes it easier to gauge the total number of runners and to understand how many runners finished within each time range. It provides a clear visual representation of the distribution of finish times across the dataset. In contrast, the boxplot does not clearly show the quantity of runners, but it is effective for identifying trends such as the median finish time and any outliers.The y-axis in the boxplot does not clearly represent what is being measured.

2b

# Create the box plot
ggplot(marathon, aes(x = division, y = time_hrs, fill = division)) +
  geom_boxplot(outlier.size = 2) +
  scale_fill_manual(
    values = c("Men" = "pink", "Women" = "deepskyblue3")
  ) +
  labs (
    title = "Box plot of all runners in data set",
    x = "Division",
    y = "Time in hours",
    fill = "Division"
  ) +
  theme_minimal(base_size = 11)

The boxplot, with the fill parameter set to division (men or women), provides a better visual comparison of performance based on gender. It shows that, in general, men tend to finish before the 2.25-hour mark, while women typically finish closer to or just after the 2.5-hour mark. The outliers for the men’s division extend slightly beyond 2.5 hours, whereas the women’s outliers appear after the 3-hour mark.

2c

# Create the histogram plot
ggplot(marathon, aes(x = time_hrs, fill = division)) +
  geom_histogram(binwidth = 0.1) +
  scale_fill_manual(
    values = c("Men" = "pink", "Women" = "deepskyblue3")
  ) +
  labs (
    title = "Histogram of all runners in data set",
    x = "Time in hours",
    y = "Count",
    fill = "Division"
  ) +
  theme_minimal(base_size = 11)

# # Create the box plot
# ggplot(marathon, aes(x = "", y = time_hrs), color = division) +
#   geom_boxplot(outlier.size = 2) +
#   scale_color_manual(values = c("Men" = "pink", "Women" = "deepskyblue3")) +
#   scale_fill_manual(
#     values = c("Men" = "pink", "Women" = "deepskyblue3")
#   ) +
#   labs (
#     title = "Box plot of all runners in data set",
#     x = "Division",
#     y = "Time in hours",
#     fill = "Division"
#   ) +
#   theme_minimal(base_size = 11)

The separate displays for the men’s and women’s categories are redundant. The histogram effectively captures both men’s and women’s race times in a single view and makes comparing easier. This approach also improves the data-to-ink ratio by reducing unnecessary visual elements.

2d

# Create a new plot
ggplot(marathon, aes(x = year, y = time_hrs, line = division), color = division) +
  geom_line(linewidth = 0.8, color = "cornsilk4") +
  geom_point(aes(shape = division, color = division), size = 4, show.legend = FALSE) +
  scale_color_manual(values = c("Men" = "pink", "Women" = "deepskyblue3")) +
  scale_fill_manual(
    values = c("Men" = "cornsilk4", "Women" = "deepskyblue3")
  ) +
  labs (
    title = "Time series plot of all runners in data set",
    x = "Year",
    y = "Time in hours",
    fill = "Division"
  ) +
  theme_minimal(base_size = 11)
Warning: No shared levels found between `names(values)` of the manual
scale and the data's fill values.

This plot provides a clearer understanding of the performance trends in men’s and women’s race times over the years. The different shapes and colors in the scatter plot offer additional details into the race times.

3 - US counties

3a

The code displays a scatter plot of median education versus median household income. It also overlays a box plot of population in 2017 by smoking ban status on the same graph. Although the code runs without errors, it combines two unrelated visualizations on a single plot, which makes the results confusing and difficult to interpret. In addition, both layers use the same color scheme, making it hard to distinguish between the two plot types. Next, the y-axis must represent either income or population, but not both. The visual is misleading. Lastly, there is a category labeled ‘NA’ on the x-axis, from smoking_ban, that doesn’t represent valid data and should be removed or handled.

3b

It is easier to compare poverty levels in the first plot, where the faceting variables are displayed vertically or by row. From this plot, we observe that individuals who are older, more educated, and have lower poverty levels are more likely to be homeowners. This layout makes patterns easier to identify and shows clearer comparisons across groups. In contrast, the version where faceting variables are displayed horizontally by column can be misleading. It gives the impression that individuals with only a high school diploma or some college experience are more likely to own homes at a younger age, which is not necessarily supported by the data. This highlights the importance of intentional facet layout. Row based faceting displays better comparisons when there are multiple groupings.

3c

# Plot A
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty)) +
  labs (
    title = "Plot A",
  ) +
  theme_gray(base_size = 11)

# Plot B 
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty)) +
  geom_smooth(aes(x = homeownership, y = poverty), color = "blue", se = FALSE) +
  labs (
    title = "Plot B",
  ) +
  theme_gray(base_size = 11)  
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot C 
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty)) +
  geom_smooth(aes(x = homeownership, y = poverty, group = metro), color = "green", se = FALSE, show.legend = FALSE) +
  labs (
    title = "Plot C",
  ) +
  theme_gray(base_size = 11) 
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot D
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_smooth(aes(x = homeownership, y = poverty, group = metro), color = "blue", se = FALSE, show.legend = FALSE) +
  geom_point(aes(x = homeownership, y = poverty)) +
  labs (
    title = "Plot D",
  ) +
  theme_gray(base_size = 11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot E
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty, color = metro)) +
  geom_smooth(aes(x = homeownership, y = poverty, linetype = metro), color = "blue", se = FALSE) +
  labs (
    title = "Plot E",
  ) +
  theme_gray(base_size = 11)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot F
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty, color = metro)) +
  geom_smooth(aes(x = homeownership, y = poverty, color = metro), se = FALSE) +
  labs (
    title = "Plot F",
  ) +
  theme_gray(base_size = 11) 
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot G
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty, color = metro)) +
  geom_smooth(aes(x = homeownership, y = poverty), color = "blue", se = FALSE) +
  labs (
    title = "Plot G",
  ) +
  theme_gray(base_size = 11) 
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs
= "cs")'

# Plot H
ggplot(county %>% filter(!is.na(median_edu))) + 
  geom_point(aes(x = homeownership, y = poverty, color = metro)) +
  labs (
    title = "Plot H",
  ) +
  theme_gray(base_size = 11)

4 - Credit Card Balances

4a

# Read in data from credit file
credit <- read_csv(here(
  "data" ,"credit.csv"), show_col_types = FALSE)

# Create the plot
ggplot(credit) + 
  geom_point(aes(x = income, y = balance, color = student, shape = student), show.legend = FALSE) + 
  geom_smooth(aes(x = income, y = balance, color = student), method = "lm", se = FALSE, show.legend = FALSE) +
  scale_color_manual(
    values = c("Yes" = "#AA93B0", "No" = "#9ECAC8")
  ) +
  scale_shape_manual(
    values = c("circle", "triangle" ) 
  ) +
  labs (
    x = "Income",
    y = "Credit card balance",
    caption = "Scources: https://stackoverflow.com/questions/26191833/add-panel-border-to-ggplot2"
  ) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_dollar()) +
  facet_grid(
    student ~ married,
    labeller = labeller(
      student = c("Yes" = "student: Yes", "No" = "student: No"),
      married = c("Yes" = "married: Yes", "No" = "married: No")
      ) 
  ) + 
    theme_minimal(base_size = 14) +
    theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5),
    strip.background = element_rect(fill = "gray90", color = "black"),
    )
`geom_smooth()` using formula = 'y ~ x'

The general trend shows that individuals with higher incomes tend to carry higher credit card balances. Among lower income earners, there is greater variability. Some have low balances while others carry substantial debt.

Students who are not married typically have the lowest incomes, but still carry relatively high credit card debt. In contrast, married students tend to have slightly higher incomes and lower balances, suggesting more financial stability.

Married individuals who are not students appear to include the highest income earners. Some of them also hold the largest credit card balances and these are the most extreme outliers in the dataset. Meanwhile, unmarried non-students show a more even distribution in both income and credit card balance.

4b

The combination of marital and student status, along with income, provides insight into credit card balance trends. Married individuals who are not students generally have the highest incomes and include outliers with the largest credit card balances. In contrast, unmarried students tend to have the lowest incomes and display a wide range of credit card debt levels.

4c

# Wrangle data - credit card utilization
credit <- credit |>
  mutate(
    utilization = (balance / limit)
  )

# Create the plot
ggplot(credit) + 
  geom_point(aes(x = income, y = utilization, color = student, shape = student), show.legend = FALSE) + 
  geom_smooth(aes(x = income, y = utilization, color = student), method = "lm", se = FALSE, show.legend = FALSE) +
  scale_color_manual(
    values = c("Yes" = "#AA93B0", "No" = "#9ECAC8")
  ) +
  scale_shape_manual(
    values = c("circle", "triangle" ) 
  ) +
  labs (
    x = "Income",
    y = "Credit utilization",
    caption = "Scources: https://stackoverflow.com/questions/26191833/add-panel-border-to-ggplot2"
  ) +
  scale_x_continuous(labels = label_dollar(suffix = "K")) +
  scale_y_continuous(labels = label_percent()) +
  facet_grid(
    student ~ married,
    labeller = labeller(
      student = c("Yes" = "student: Yes", "No" = "student: No"),
      married = c("Yes" = "married: Yes", "No" = "married: No")
      ) 
  ) + 
    theme_minimal(base_size = 14) +
    theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5),
    strip.background = element_rect(fill = "gray90", color = "black"),
    )
`geom_smooth()` using formula = 'y ~ x'

4d

Unmarried students have the highest credit card utilization relative to their income. Similarly, married students with lower incomes also show high utilization rates. Overall, there is a trend where lower income individuals tend to use a larger portion of their available credit.

In contrast, while married non-students often carry higher credit card balances, their utilization rates remain relatively low, typically below 20% of their credit limit. This suggests that higher income earners tend to have greater available credit and are less reliant on it, while lower income individuals may have limited credit access and use a higher percentage of what is available to them.

Sources

Grid design in question 4 directly borrowed from: https://stackoverflow.com/questions/26191833/add-panel-border-to-ggplot2

5 - Napoleon’s march.

# Read in data from napoleon file
napoleon <- read_rds(here(
  "data" ,"napoleon.rds"))

# Create the plot
# This line creates a ggplot using the troop data extracted from the Napoleon RDS file. 
# Within the aesthetics, the x-axis represents longitude, the y-axis represents latitude, 
# each group is plotted as a separate line, the color indicates the direction of troop 
# movement, and the line thickness corresponds to the size of the troops.
ggplot(napoleon$troops, aes(x = long, y = lat, group = group, color = direction, linewidth = survivors), show.legend = FALSE) + 
  # The function geom_path() creates lines by connecting data points, and setting 
  # show.legend = FALSE prevents the corresponding legend from appearing in the plot.
  geom_path(lineend = "round", show.legend = FALSE) + 
  # The function geom_point() creates a scatter plot.To make these points easier to point out the color is set to #DC5b44.
  geom_point(data = napoleon$cities, aes(x = long, y = lat), color = "#DC5b44", inherit.aes = FALSE) +
  # The function geom_text_repel() is similar to geom_text() except text won't overlap. The font size, text style, color 
  # are set and setting show.legend = FALSE prevents the corresponding legend from appearing in the plot. 
  geom_text_repel(data = napoleon$cities, aes(x = long, y = lat, label = city), size = 2, family = "serif", fontface = "bold.italic", color = "#DC5b44", inherit.aes = FALSE) +
  # scale_size refers to the thickness limits of the lines in the plot - in this case it is the size troops. 
  scale_size(range = c(0.09, 8)) +
  # Manually setting the colors for the groups mapped to the color aesthetic.
  scale_color_manual(values = c("#DFC17E", "#252523")) +
  # Labels titles in the plot
  labs (
    title = "Recreation of Napoleon's march plot by Charles John Minard",
    subtitle = "Napoleon's 1812 winter retreat from moscow",
    size = "Survivors",
    color = "Direction"
  ) +
  # Sets the visible limits of the plot without removing data outside the range.
  coord_cartesian(xlim = c(24, 38), ylim = c(53.75, 55.75)) + 
  # Sets light grey background with white lines and text elements set to 11 units.
  theme_gray(base_size = 11) +
  # The height is 0.2 times the width. Plot will appear much wider.
  theme(aspect.ratio = 0.2)

Sources

Directly borrowed code snippet from: https://www.andrewheiss.com/blog/2017/08/10/exploring-minards-1812-plot-with-ggplot2/

To stretch x axis in question 5, code was inspired by: https://ggplot2.tidyverse.org/reference/coord_fixed.html

Used this to solve bug in question 5 - geom_point and geom_text inherited the parents group setting. Code inspired by: https://stackoverflow.com/questions/69817450/r-ggplot2-understanding-the-parameters-of-the-aes-function

From question 5, geom_text_repel parameter inspiration from: http://www.cookbook-r.com/Graphs/Fonts/